single-hidden-layer neural network
The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent. While the efficiency of such methods depends crucially on the local curvature of the loss surface, very little is actually known about how this geometry depends on network architecture and hyperparameters. In this work, we extend a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix. We focus on a single-hidden-layer neural network with Gaussian data and weights and provide an exact expression for the spectrum in the limit of infinite width. We find that linear networks suffer worse conditioning than nonlinear networks and that nonlinear networks are generically non-degenerate. We also predict and demonstrate empirically that by adjusting the nonlinearity, the spectrum can be tuned so as to improve the efficiency of first-order optimization methods.
- Europe > Switzerland > Zürich > Zürich (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
Pennington, Jeffrey, Worah, Pratik
An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent. While the efficiency of such methods depends crucially on the local curvature of the loss surface, very little is actually known about how this geometry depends on network architecture and hyperparameters. In this work, we extend a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix. We focus on a single-hidden-layer neural network with Gaussian data and weights and provide an exact expression for the spectrum in the limit of infinite width. We find that linear networks suffer worse conditioning than nonlinear networks and that nonlinear networks are generically non-degenerate. We also predict and demonstrate empirically that by adjusting the nonlinearity, the spectrum can be tuned so as to improve the efficiency of first-order optimization methods.
The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
Pennington, Jeffrey, Worah, Pratik
An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent. While the efficiency of such methods depends crucially on the local curvature of the loss surface, very little is actually known about how this geometry depends on network architecture and hyperparameters. In this work, we extend a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix. We focus on a single-hidden-layer neural network with Gaussian data and weights and provide an exact expression for the spectrum in the limit of infinite width. We find that linear networks suffer worse conditioning than nonlinear networks and that nonlinear networks are generically non-degenerate. We also predict and demonstrate empirically that by adjusting the nonlinearity, the spectrum can be tuned so as to improve the efficiency of first-order optimization methods.
The Upper Bound on Knots in Neural Networks
In recent years, neural networks--and deep neural networks in particular--have succeeded exceedingly well in such a great plethora of data-driven problems, so as to herald an entire paradigm shift in the way data science is approached. Many everyday computerized tasks--such as image and optical character recognition, the personalization of Internet search results and advertisements, and even playing games such as chess, backgammon, and Go--have been deeply impacted and vastly improved by the application of neural networks. The applications of neural networks, however, have advanced significantly more rapidly than the theoretical understanding of their successes. Elements of neural network structures--such as the division of vector spaces into convex polytopes, and the application of nonlinear activation functions--afford neural networks a great flexibility to model many classes of functions with spectacular accuracy. The flexibility is embodied in universal approximation theorems (Cybenko 1989; Hornik et al. 1989; Hornik 1991; Sonoda and Murata 2015), which essentially state that neural networks can model any continuous function arbitrarily well. The complexity of neural networks, however, have also made their analytical understanding somewhat elusive. The general thrust of this paper, as well as two companion papers (Chen et al. 2016b,a), is to explore some unsolved elements of neural network theory, and to do so in a way that is independent of specific problems. In the broadest sense, we seek to understand what models neural networks are capable of producing. There exist many variations of neural networks, such as convolutional neural networks, recurrent neural networks, and long short-term memory models, each having their own arenas of success.